Non-intrusive Quality Assessment of Synthesized Speech using Spectral Features and Support Vector Regression
نویسندگان
چکیده
In this paper, we propose a new quality assessment method for synthesized speech. Unlike previous approaches which uses Hidden Markov Model (HMM) trained on natural utterances as a reference model to predict the quality of synthesized speech, proposed approach uses knowledge about synthesized speech while training the model. The previous approach has been successfully applied in the quality assessment of synthesized speech for the German language. However, it gave poor results for English language databases such as Blizzard Challenge 2008 and 2009 databases. The problem of quality assessment of synthesized speech is posed as a regression problem. The mapping between statistical properties of spectral features extracted from the speech signal and corresponding speech quality score (MOS) was found using Support Vector Regression (SVR). All the experiments were done on Blizzard Challenge Databases of the year 2008, 2009, 2010 and 2012. The results of experiments show that by including knowledge about synthesized speech while training, the performance of quality assessment system can be improved. Moreover, the accuracy of quality assessment system heavily depends on the kind of synthesis system used for signal generation. On Blizzard 2008 and 2009 database, proposed approach gives correlation of 0.28 and 0.49, respectively, for about 17 % data used in training. Previous approach gives correlation of 0.3 and 0.09, respectively, using spectral features. For Blizzard 2012 database, proposed approach gives correlation of 0.8 by using 12 % of available data in training. Index Terms Quality assessment, autoencoder, subband.
منابع مشابه
A Bayesian approach to non-intrusive quality assessment of speech
A Bayesian approach to non-intrusive quality assessment of narrow-band speech is presented. The speech features used to assess quality are the sample mean and variance of bandpowers evaluated from the temporal envelope in the channels of an auditory filter-bank. Bayesian multivariate adaptive regression splines (BMARS) is used to map features into quality ratings. The proposed combination of fe...
متن کاملNovel Subband Autoencoder Features for Non-Intrusive Quality Assessment of Noise Suppressed Speech
In this paper, we propose a novel feature extraction architecture of Deep Neural Network (DNN), namely, subband autoencoder (SBAE). The proposed architecture is inspired by the Human Auditory System (HAS) and extracts features from speech spectrum in an unsupervised manner. We have used features extracted by this architecture for non-intrusive objective quality assessment of noise suppressed sp...
متن کاملImproving of Feature Selection in Speech Emotion Recognition Based-on Hybrid Evolutionary Algorithms
One of the important issues in speech emotion recognizing is selecting of appropriate feature sets in order to improve the detection rate and classification accuracy. In last studies researchers tried to select the appropriate features for classification by using the selecting and reducing the space of features methods, such as the Fisher and PCA. In this research, a hybrid evolutionary algorit...
متن کاملNon-intrusive Estimation of Noisiness as a Perceptual Quality Dimension of Transmitted Speech
This article presents a new approach to the non-intrusive quality estimation of transmitted speech. Traditional estimation methods exhibit limitations to providing diagnostic information and for practical monitoring purposes. The new approach merges solutions to overcome the existing limitations and intends to provide a new user-friendly estimator. We present an overview and the planned structu...
متن کاملمقایسه روش های طیفی برای شناسایی زبان گفتاری
Identifying spoken language automatically is to identify a language from the speech signal. Language identification systems can be divided into two categories, spectral-based methods and phonetic-based methods. In the former, short-time characteristics of speech spectrum are extracted as a multi-dimensional vector. The statistical model of these features is then obtained for each language. The ...
متن کامل